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Evolution of gene regulation is crucial for our understanding of the phenotypic differences be¬ 
tween species, populations and individuals. Sequence-specific binding of transcription factors to 
the regulatory regions on the DNA is a key regulatory mechanism that determines gene expression 
and hence heritable phenotypic variation. We use a biophysical model for directional selection on 
gene expression to estimate the rates of gain and loss of transcription factor binding sites (TFBS) 
in finite populations under both point and insertion/deletion mutations. Our results show that 
these rates are typically slow for a single TFBS in an isolated DNA region, unless the selection is 
extremely strong. These rates decrease drastically with increasing TFBS length or increasingly spe¬ 
cific protein-DNA interactions, making the evolution of sites longer than ~ 10 bp unlikely on typical 
eukaryotic speciation timescales. Similarly, evolution converges to the stationary distribution of 
binding sequences very slowly, making the equilibrium assumption questionable. The availability of 
longer regulatory sequences in which multiple binding sites can evolve simultaneously, the presence 
of “pre-sites” or partially decayed old sites in the initial sequence, and biophysical cooperativity 
between transcription factors, can all facilitate gain of TFBS and reconcile theoretical calculations 
with timescales inferred from comparative genomics. 


Author Summary 

Evolution has produced a remarkable diversity of living forms that manifests in qualitative differences as well 
as quantitative traits. An essential factor that underlies this variability is transcription factor binding sites, short 
pieces of DNA that control gene expression levels. Nevertheless, we lack a thorough theoretical understanding of 
the evolutionary times required for the appearance and disappearance of these sites. By combining a biophysically 
realistic model for how cells read out information in transcription factor binding sites with model for DNA sequence 
evolution, we explore these timescales and ask what factors crucially affect them. We find that the emergence of 
binding sites from a random sequence is generically slow under point and insertion/deletion mutational mechanisms. 
Strong selection, sufficient genomic sequence in which the sites can evolve, the existence of partially decayed old 
binding sites in the sequence, as well as certain biophysical mechanisms such as cooperativity, can accelerate the 
binding site gain times and make them consistent with the timescales suggested by comparative analyses of genomic 
data. 


INTRODUCTION 

Evolution produces heritable phenotypic variation within and between populations and species on relatively short 
timescales. Part of this variation is due to differences in gene regulation, which determines how much of each 
gene product exists in every cell. These gene expression levels are heritable quantitative traits subject to natural 
selection m- While the importance of their variability for the observed phenotypic variation is still debated [1] , it is 
believed to be crucial within closely related species or in populations whose proteins are functionally or structurally 
similar [5]. The genetic basis for gene expression differences is thought to be non-coding regulatory DNA, but our 
understanding of its evolution is still immature; this is due, in part, to the lack of precise knowledge about the mapping 
between the regulatory sequence and the resulting expression levels. 

Transcriptional regulation is the most extensively studied mechanism of gene regulation. Transcription factor 
proteins (TFs) recognize and bind specihc DNA sequences called binding sites, thereby affecting the expression of 
target genes. Eukaryotic regulatory sequences, i.e., enhancers and promoters, are typically between a hundred and 
several thousand base pairs (bp) in length [6], and can harbor many transcription factor binding sites (TFBSs), each 
typically consisting of 6 — 12 bp. The situation is different in prokaryotes: they lack enhancer regions and have one or 
a few TFBSs which are typically longer, between 10 to 20 bp in length BIB]. Differences in TF binding are thought to 
arise primarily due to changes in the regulatory sequence at the TF binding sites rather than changes in the cellular 
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environment or the TF proteins themselves m- Nevertheless, a theoretical understanding of the relationship between 
the evolution of the regulatory sequence and the evolution of gene expression levels remains elusive, mostly because 
of the complex interaction of evolutionary forces and biophysical processes m- 

From the evolutionary perspective, the crucial question is whether and when these regulatory sequences can evolve 
rapidly enough so that new phenotypic variants can arise and fix in the population over typical speciation timescales. 
Comparative genomic studies in eukaryotes provide evidence for the evolutionary dynamics of TF binding, highlighting 
the possibility for rapid and flexible TFBS gain and loss between closely related species on timescales of as little as a 
few million years [HI [ISj . Examples include quick gain and loss events that cause divergent gene expression [T4| , or the 
compensation of such events by turn-over at other genome locations gain and loss events sometimes occur even 
in the presence of strong constraints on expression levels [mm- Furthermore, such events enabled new binding sites 
on sex chromosomes that arose as recently as 1 — 2 million years ago mills]. There are examples of rapid regulatory 
DNA evolution across and within populations requiring shorter timescales, i.e. 10.000— 100.000 years pi fSOKH] . On 
the other hand, strict conservation has also been observed at orthologous regulatory locations even in distant species 
(e.g., [25). Taken together, these facts suggest that the rates of TFBS evolution can extend over many orders of 
magnitude and differ greatly from the point mutation rate at a neutral site. To study the evolutionary dynamics of 
regulatory sequences and understand the relevant timescales, we set up a theoretical framework with a special focus 
on the interplay of both population genetic and biophysical factors, briefly outlined below. 

Sequence innovations originate from diverse mutational mechanisms in the genome. While tandem repeats [24j or 
transposable elements [25] may be important in evolution, the better studied and more widespread mutation types still 
need to be better understood in the context of TFBS evolution. Specifically, we ask how the evolutionary dynamics 
are affected by single nucleotide (point) mutations, as well as by insertions and deletions (indels). New mutations in 
the population are selected or eliminated by the combined effects of selection and random genetic drift. Although the 
importance of selection [2HH25 and mutational closeness of the initial sequences [221120] for TF binding site evolution 
has already been reported, the belief in fast evolution via point mutations without selection (i.e., neutral evolution) 
persists in the literature (e.g., [ansi), mainly due to Stone & Wray’s (2001) misinterpretation of their own simulation 
results [3T| (see Macarthur & Brookfield (2004) [ 20 ] )■ This likely reflects the current lack of theoretical understanding 
of TFBS evolution in the literature, even under the simplest case of directional selection. Basic population genetics 
shows that directional selection is expected to cause a change, e.g., yield a functional binding site, over times on 
the order of l/{NsUb), where N is the population size, s is the selection advantage of a binding site, and Ub is the 
beneficial mutation rate |32| . This process can be extremely slow, especially under neutrality, if several mutational 
steps are needed to reach a sequence with sufficient binding energy to confer a selective advantage. As already pointed 
out by Berg et al. (2004) [32] i this places strong constraints on the length of the binding sites, if they were to evolve 
from random sequences. 

Several biophysical factors, such as TF concentration and the energetics of TF-DNA and TF-TF interactions, might 
play an important role in TFBS evolution. Quantitative models for TF sequence specificity [33U38] and for thermo¬ 
dynamic (TD) equilibrium of TF occupancy on DNA [M] [35M5] were developed in recent decades and, in parallel 
with developments in sequencing, have contributed to our understanding of TF-DNA interaction biophysics. These 
biophysical factors can shape the characteristics of the TFBS fitness landscape over genotype space in evolution¬ 
ary models [i 121 m 122H2Z]. There are also intensive efforts to understand the mapping from promoter/enhancer 
sequences to gene expression [H 148115(1] . Despite this recent attention, there have been relatively few attempts to 
understand the evolutionary dynamics of TFBS in full promoter/enhancer regions [2911431 [5TH53] . especially using bio- 
physically realistic but still mathematically tractable models. Such models are necessary to gain a thorough theoretical 
understanding of binding site evolution. 

Our aim in this study is to investigate the dynamics of TFBS evolution by focusing on the typical evolutionary 
rates for individual TFBS gain and loss events. We consider both a single binding site at an isolated DNA region 
and a full enhancer/promoter region, able to harbor multiple binding sites. In the following section, we lay out our 
modeling framework, which covers both population genetic and biophysical considerations, as outlined above. Using 
this framework, we try to understand i) what typical gain and loss rates are for a single TFBS site; ii) how quickly 
populations converge to a stationary distribution for a single TFBS; iii) how multiple TFBS evolve in enhancers 
and promoters; iv) how early history of the evolving sequences can change the evolutionary rates of TFBS; and v) 
how cooperativity between TFs affects the evolution of gene expression. We find that, under realistic parameter 
ranges, both gain and loss of a single binding site is slow, slower than the typical divergence time between species. 
Importantly, fast emergence of an isolated TFBS requires strong selection and favorable initial sequences in the 
mutational neighborhood of a strong TFBS. The evolutionary process approaches the equilibrium distribution very 
slowly, raising concerns about the use of equilibrium assumptions in theoretical work. We proceed to show that the 
dynamics of TFBS evolution in larger sequences can be understood approximately from the dynamics of single binding 
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sites; the TFBS gain times are again slow if evolution starts from random sequence in the absence of strong selection 
or large regulatory sequence “real estate.” Finally, we identify two factors that can speed up the emergence of TFBS: 
the existence of an initial sequence distribution biased towards the mutational neighborhood of strongly binding 
sequences, which suggests that ancient evolutionary history can play a major role in the emergence of “novelties” |54j : 
and the biophysical cooperativity between transcription factors, which can partially account for the lack of observed 
correlation between identifiable binding sequences and transcriptional activity m- 


MODELS & METHODS 


Population genetics 


We consider a finite population of N diploid individuals whose genetic content consists of an evolvable L base pair 
(bp) contiguous regulatory sequence cr to which TFs can bind. Given that cr^ G {A, C, G, T} where i = 1, 2, ..., L 
indexes the position in regulatory sequence, there are 4^ different regulatory sequences in the genotype space. Each 
TF is assumed to bind to a contiguous sequence of n bp within our focal region of L bp (Fig. ,B). Regulatory 
sequences evolve under mutation, selection, and sampling drift. The rest of the genome is assumed to be identical 
for all individuals and is kept constant. In the first part of our study we consider the regulatory sequence comprised 
of a single TFBS (i.e. L = n). Later, we consider the evolution of a longer sequence (i.e. L 3> n) in which more 
than one TFBS can evolve. For simulations, we use a Wright-Fisher model where N diploid individuals are sampled 
from the previous generation after mutation and selection. Our analytical treatment is general and corresponds to 
setups where a diffusion approximation to allele frequency evolution is valid. We neglect recombination since typical 
regulatory sequences are short, L < 1000. To be consistent with most of the population genetics literature we assume 
diploidy, but since we do not consider any dominance effects, our results also hold for a haploid population with 2N 
individuals. 

Evolutionary dynamics simplify in the low mutation limit where the population consists of a single genotype during 
most of its evolutionary history (the fixed state population model). Desai & Fisher |SS] have shown that the condition 
log^jVA/ ^ 4 jvJ,,A/ needs to hold for a fixed state population assumption to be accurate. The term on the left is 
the establishment time of a mutant allele with a selective advantage A/ relative to the wild type; the term on the 
right-hand side is the waiting time for such an allele to appear, where C4 is the beneficial mutation rate per individual 
per generation. Note that, in binding site context, C4 refers to the rate of mutations which increase the fitness, for 
instance, by increasing binding strength. Its exact value depends on the current state of the genotype; nevertheless, 
typical value estimates help model the evolutionary dynamics. In multicellular eukaryotes, where most evidence for 
the evolution of TFBSs has been collected and which provide the motivation for this manuscript, the number of 
mutations per nucleotide site is typically low, e.g. ANu ^ 0.01 in Drosophila and ANu ~ 0.001 in humans [56], where 
u is the point mutation rate per generation per base pair. For a single binding site of typical length n ~ 5 — 15, one 
therefore expects the fixed state population model to be accurate. For longer regulatory sequences, one expects that 
beneficial mutations are rare among all possible mutations, so that the fixed state population model can be assumed 
to hold as well. 

Evolution under the fixed state assumption can be treated as a simple Markovian jump process. The transition 
rate from a regulatory sequence cr to another regulatory sequence cr' in a diploid population is 


= 2N Pfix(A, A/^y,,) 


( 1 ) 


where — /(f) is the fitness difference and LG'^cr is the mutation rate from cr to cr' . The fixation 

probability of a mutation with fitness difference A/ in a diploid population of N individuals is 


PUN, A/) 


1 - 
1 - 


2A/ 

1 _ g-4NAf > 


( 2 ) 


which is based on the diffusion approximation m- Note that the fixation probability scaled with 1/iV approximates 
to 2NAf when NAf ^ 1. Evolutionary dynamics therefore depend essentially on how regulatory sequences are 
mutationally connected in genotype space, and how fitnesses differ between neighboring genotypes, i.e., on the fitness 
landscape. 
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FIG. 1: Biophysics of transcription regulation. A) TFs bind to regulatory DNA regions (promoters and enhancers) 
in a sequence-specific manner to regnlate transcriptional gene expression (mRNA production) level via different mechanisms, 
such as recruiting RNA polymerase (RNA-pol). B) A schematic of two types of mutational processes that we model: point 
mutations (left) and indel mutations (right). C) The mismatch binding model results in redundancy of genotype classes, with 
a binomial distribution (red) of genotypes in each mismatch class (some examples of degenerate sequences shown) D) The 
mapping from the TFBS regulatory sequence to gene expression level is determined by the thermodynamic occupancy (binding 
probability) of the binding site. If each of the k mismatches from the consensus sequence decreases the binding energy by e, the 
occupancy of the binding site is nToik) = + , where ^ is the chemical potential (related to free TF concentration). 

A typical occupancy curve is shown in black (e = 2kBT and jj. = AksT)-, the gray curves show the effect of perturbation to 
these parameters (e = IksT, e = SksT and jj, = 6fcsT); the orange curve illustrates the case of two cooperatively binding 
TFs (fee = 0 and Ec = —SksT, see text for details). We pick two thresholds, shown in dashed lines, to define discrete binding 
classes: strong 5 (tttd > 2/3) and weak W (tttd < 1/3). 
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Directional selection on biophysically motivated fitness landscapes 

In this study, we focus on directional selection by assuming that fitness / is proportional to gene expression level g 
which depends on regulatory sequence, i.e. 


f{(T) = sg{cT) (3) 

where s is the selection strength. It is important to note that this choice does not imply that directional selection is 
the only natural selection mechanism. It simply aims at obtaining the theoretical upper limits for the rates of gaining 
and losing binding sites. 

To analyze a realistic but tractable mapping from the regulatory sequence to fitness, we primarily assume that the 
proxy for gene expression is the binding occupancy (binding probability) tt at a single TF binding site, or the sum 
of the binding occupancies within an enhancer/promoter region (based on limited experimental support [84] 1. This 
corresponds to 


i 


( 4 ) 


where is the binding occupancy of a site starting at the nucleotide i in sequence cr, and s can be interpreted 
as the selective advantage of a strongest binding to a weakest binding at a site. We assume all binding sites have 
equal strength and direction in their contribution towards total gene activation. Sites acting as repressors in our 
simple model would enter into Eq Q with a negative selection strength, s. Future studies developing mathematically 
tractable models should consider more realistic case of unequal contribution with combined activator and repressor 
sites responding differentially to various regulatory inputs [53]. Although one can postulate different scenarios that 
map TF occupancies in a long (L ^ n) promoter to gene expression, we chose the simplest case which allows us to 
make analytical calculations. Later we relax our assumption on noninteracting binding sites and consider the effects 
of several kinds of interactions on gene expression and thus on evolutionary dynamics. 

The occupancy of the TF on its binding site is assumed to be in thermodynamic (TD) equilibrium [5^I551 - H5] . While 
this might not always be realistic jSS] |Sn| , there is empirical support for this assumption (particularly in prokaryotes) 
[SleolllII, and more importantly, it is sufficient to capture the essential nonlinearity in this genotype-phenotype- 
fitness mapping |62] . In thermodynamic equilibrium, the binding occupancy at the site starting with the f-th position 
in regulatory sequence is given by 

+ ( 5 ) 

Here, g is the chemical potential of the TF (related to its free concentration) |44l|64|; Ei is the sequence specific 
binding energy, where lower energy corresponds to tighter binding, and /3 = (A:bT)“^. We compute the binding energy 
Ei by adopting an additive energy model which is considered to be valid at least up to a few mismatches from the 
consensus sequence [SHIMlISaiSS], i e. 


i-\-n— 1 

^ CcT,J (6) 

j-i 

where ^ stands for the energy matrix whose ,j element gives the energetic contribution of the nucleotide aj appearing 
at the j-th position within TFBS. With this, Eq (|^ can be rewritten more formally as 

fif^)=sJ2n^l,{E,{cr)) ( 7 ) 

i 

To allow analytical progress, we make the “mismatch assumption,” i.e., the energy matrices contain identical e > 0 
entries for every non-consensus (mismatch) base pair; the consensus entries are set to zero by convention. A single 
binding sequence with k mismatches therefore has the binding energy E = ke. We will refer to e as “specificity.” 
Specificity is provided by diverse interactions between DNA and TF, including specific hydrogen bonds, van der Waals 
forces, steric exclusions, unpaired polar atoms, etc. [63]. e is expected to be in the range 1 — 3 ksT, which is consistent 
with theoretical arguments [H] as well as direct measurements [^51 - 157] . Note that we explicitly check the validity 
of the analytical results based on the mismatch assumption by comparing them against simulations using realistic 
energy matrices. The redundancy (i.e., normalized number of distinct sequences) of a mismatch class fc at a single site 


6 


in a random genome can be described by a binomial distribution (p (Fig. [^) where the probability of encountering 
a mismatch class k is 

a) = - a)" ^ (8) 

where a = 3/4 in the case of equiprobable distribution over the four nucleotides. 

We focus on selection in a single environment, which in this framework corresponds to a single choice for the TF 
concentration. We therefore fix the chemical potential to a baseline value of ^ = 4 ksT, which maps changes in the 
sequence (mismatch class k) to a full range of gene expression levels, as shown in Fig. EP- We subsequently vary /i 
systematically and report how its value affects the results. 

After these preliminaries, the equilibrium binding probability of Eq ([^ reduces to 

7rTD(fc) = + (9) 

This function has a sigmoid shape whose steepness depends on specificity e and whose midpoint depends on the 
ratio of chemical potential to specificity, fi/e (Fig. [^). To simplify discussion, we introduce two classes of sequences: 
genotypes are associated with “strong binding” S and “weak binding” W if tttd >2/3 and tttd <1/3, respectively. 
The thresholds that we pick are arbitrary, while still placing the midpoint of the sigmoid between the two classes; 
our results do not change qualitatively for other choices of thresholds. In the mismatch approximation, the genotype 
classes k = {0, 1, ..., G S and k = {few, few + !> n} G W correspond to strong and weak binding, respectively. 
ks and fcyy are defined as the closest integers to the thresholds defined above; these values depend on e and /r. We 
also define a “presite” as the mismatch class that is 1 bp away from the threshold for strong binding, i.e., a class with 

+ 1 mismatches. Note that binding length n extends the tail of the fitness landscape for a single site and shifts the 
center of redundancy rich mismatch classes (Fig. HP)- 

The formulation in Eq Q reduces to 

/(A:) = s tttd (A:) (10) 

in a mismatch approximation at a single site, which we will investigate extensively for Ns scaling of TFBS gain and 
loss rates. We consider a wide range of Ns values: Ns < 0 for negative selection. Ns = 0 for neutral evolution. 
Ns ^ 1 for weak positive selection. Ns >> nlog(2)/2 for strong positive selection (see below for this particular choice 
of the threshold). 

In order to study the effects of interacting TFBSs in large regulatory sequences, we relax our assumption of non¬ 
interacting TFBS in Eq 0 and study three simple models. In the main text, we report the cooperative physical 
interaction between two TE molecules binding two nearby sites where the binding probability at a site is modified as 

g-/3(efc-/j) _j_ g-l3(e(k+ka)-2fi-Ec) 

^C00P(A', ^c) ^ ^ 

where k^ stands for the mismatch class at the co-binding site and Ec for cooperativity. In this study we consider 
that cooperative energy ranges from an intermediate strength {Ec = —2A:bT) to a high strength {E^ = —4 ksT) m- 
Eig[^ shows an example of the binding probability when a strong co-binding site exists. As a function of k alone, at 
fixed kc, this formulation of cooperativity is consistent with the zero-cooperativity {Ec = 0) case but with a changed 
effective chemical potential. We take cooperative interactions into account if the two TEs are binding within 3 bp of 
each other, and we only consider the strongest binding of the cooperative partner (i.e., the proximal location with the 
lowest kc). 

In Supporting Information, we discuss the other two models of interacting TFBS. In one model, gene expression 
is determined only by the binding probability of the strongest site in the regulatory sequence. In the other model, 
gene expression is determined by the probability of the joint occupancy of 2 strongest binding sites, anywhere in 
the regulatory sequence; this model is a toy version of synergistic “non-physical” interaction of TFs which compete 
with nucleosomal binding for the occupancy of regulatory regions in eukaryotes (see Mirny (2010) [68] for a detailed 
model). 


Point and indel mutations 

Point mutations and indels are the only mutational processes in our framework. Point mutations with a rate u 
convert the nucleotide at one position into one of the 3 other nucleotide types. For a single binding site, the probability 
that a point mutation changes the mismatch class from k to k' is 
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“ ^/”-) ^fe'.fc+i + (fc/3n) 4',fc-i + {2k/3n) Sk',k (12) 

where 4,& = 1 if a = 6 and 0 otherwise. 

We define the indel mutation rate per base pair such that it occurs with rate at a position where a random 
nucleotide sequence is either inserted, or an existing nucleotide sequence is deleted. For mathematical simplicity, 
we assume that insertions and deletions are equally likely; in fact, a slight bias towards deletions is reported in the 
literature with a ratio of deletion to insertion ~ 1.1 — 3.0 [69ll71) . Parameter 9 is the ratio of indel mutation rate 
to point mutation rate, and is reported to be in the range 0.1 — 0.2 [znm. We consider two cases: the baseline 
of d = 0 for no indel mutations, and 9 = 0.15 for the combined effect of indel and point mutations. Since we fix 
the length of the regulatory sequence, indels shift existing positions away from or inwards to some reference position 
(e.g., transcription start site). For consistency, we fix the regulatory sequence at its final position and assume that 
sequences before the initial position are random. Indel lengths vary, with reports suggesting a sharply decreasing 
but fat-tail frequency distribution m- For simulations we consider only very short indels of size 1 — 2 bp, occurring 
proportional with their reported frequencies of 0.45 and 0.18, respectively. We do not need to assume any particular 
indel length for analytical calculations (below). While sufficient for our purposes, this setup would need to be modified 
when working with real sequence alignments of orthologous regions. 

For a single binding site (i.e. L = n) one can exactly calculate the probability of an indel mutation changing the 
mismatch class from k to k' as 

n k' 

-Pfcyf =x\k) p{Y, = k' - a:). (13) 

2—1 a ;—0 

Here, i is the index for the position of an indel mutation within the binding site. The distribution over possible 
positions is uniform (hence 1/n). The indel mutation defines two distinct parts in the binding site in terms of 
mismatches: nucleotides behind the indel mutation preserve their mismatch information, yet the nucleotides within 
and after indel mutation completely lose it. The new mismatches at these distinct parts Xi and 4 are binomial 
random variables. 


p{X^ = x\k) = cf)^{i - 1, a = k/n) , . 

p(Xi = y) = </>y(n - *-k 1 , a = 3/4) ^ 

where cj)k{n, a) is defined in Eq Figshows that Monte Carlo sampling of indel mutations at a single binding 
site matches the analytical expression in Eq Q. 

The two types of mutations can be combined into the mutation rate matrix as follows: 


Uk',k = 


n 1! ( _1_ 0 

~ J2k'^k ^k',k 


) 


k'^k 
k' = k 


(15) 


Evolutionary dynamics of single TF binding sites 


For a sequence that consists of an isolated TFBS (i.e., L = n), analytical treatment is possible under the fixed state 
assumption. Let ^p{t) be a distribution over an ensemble of populations, whose k-th component, denotes the 

probability of detecting a genotype with k mismatches at time t. In the continuous time limit, the evolution of 
is described by 


which accepts the following solution: 


= R 

(16) 

V>(t) = e«‘-V’(0). 

(17) 


Here, R is the transition rate matrix defined as 


2N Uk',k PaAN, Mk',k) k'^k 
~J2k'^k^k',k k' = k 


( 18 ) 
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This dynamical system is a continous-time Markov chain and there exists a unique stationary distribution i/j 
corresponding the genotype distribution over an ensemble of populations at large time points. It can be calculated 
by decomposing the transition rate matrix R into its eigenvalues and eigenvectors. The normalised left eigenvector 
with zero eigenvalue corresponds to the stationary distribution. This can also be expressed analytically as 




(19) 


where F{k,N) = ANf{k) captures the relative importance of selection to genetic drift, and H{k \ n,a) is the 
mutational entropy, describing how a particular mismatch class k is favored due to redundancy and connectivity of 
the genotype space. For point mutations alone (0 = 0), H = log(/)fc(n,a), with the binomial distribution 0fe(n, a) 
as defined in Eq ([^. Obtaining a closed form expression for H is difficult when considering indel mutations (0 > 0), 
yet the eigenvalue method solution suggests a similar shape for 0 in the range of interest. The form of the stationary 
distribution was known for a long time in population genetics literature for a single locus or many loci with linkage 
equilibrium m- It has recently been generalised to arbitrary sequence space under the fixed state assumption [321177], 
resulting in the form of Eq (191 with a close analogy in the energy-entropy balance of statistical physics (HD) , and 
become a subject of theoretical interest isiiisiiTniisii- 

Under weak directional selection for high expression (and thus high binding site occupancy), the stationary dis¬ 
tribution shows a bimodal shape, with one peak located around the fittest class, fc ~ 0, and another at the core of 
mutational entropy, k ^ an (recall that a = 3/4 for a completely random genome). This bimodal shape collapses 
to a unimodal one, either at no selection or at strong selection. The threshold value for Ns distinguishing strong 
and weak selection regimes primarily depends on the TFBS binding length, n. In a sigmoidal fitness landscape and 
approximating the binomial distribution by a normal distribution as appropriate, the sizes of these two peaks are 
roughly proportional to exp (47Vs — nlog4) and \j2Tia{l — a)n, respectively. Therefore, we expect the threshold Ns 

to scale as log 4 — ^ log27rQ!(l — a)n'^. For typical n, the linear term is dominant, suggesting that 


iVs ~ nlog(2)/2 (20) 

corresponds to the threshold for strong selection in TFBS evolution (cf. Fig0. Note that this n scaling differs from 
the log(n) scaling which is expected in simple fitness landscapes [52]. Our argument assumes that the system is at 
evolutionary equilibrium, which, as we will see, is not necessarily the case even under strong selection, providing 
further motivation for focusing on dynamical aspects of evolution. 

We define the time needed to gain (or lose) a TFBS as the time it takes for a strong binding site to emerge from a 
weak one (and vice versa), as schematized in Fig. HP- For an isolated TFBS, these times can be computed from the 
Markovian properties of the evolutionary dynamics, by calculating the average first hitting times [83) . We will use 
the notations and (t)w<_fe, respectively, for average gain and loss times when evolution starts from mismatch 

class k. Obviously, {t)s^k = 0 if fc is among the strong binding classes (fc € S) and (t)w^fc = 0 if fc is among the 
weak binding classes {k € W). The average gain times from other mismatch classes can be found by considering the 
relation {t)s^k = 1 + J2k'^s Pk,k' {t)s^k', where Pk,k' is the probability of transition from k' to k in one generation. 
One can compute the average gain times by writing it in terms of linear algebraic equation: 

Ts^ = • (- 1 ) ( 21 ) 

where T 5 <_ is a column vector listing non-trivial gain times, i.e. {(t)s<_fc} for A: = -|- 1, ..., n. is the R matrix 

with all rows and columns corresponding to k € S deleted and —T is the matrix operator for the transpose after an 
inverse operation. 1 is a vector of ones. Similarly one can find the loss times, 

Tw^ = (R^w)”^ • (-1) (22) 

where Ty^;^ is a column vector listing non-trivial loss times, i.e. {(t)w<_fe} for k = 1, 2, ... kw — 1. R^w is the R 
matrix with all rows and columns corresponding to k G W deleted. 

In the case of point mutations alone (0 = 0), the R matrix is tri-diagonal and one can deduce simpler formulae for 
gain and loss times: 


/,\ (point) 

W5-(-fe 


E k 1 1 — 

+ l Ri-i, i {fj. 


/j.\ (point) _ v^fcvv 1 ^i-i 

K'^IW^k — l^i=k+l 


( 23 ) 
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where we use denote the cumulative stationary distribution. For very strong selection, the second 

term in the sums approaches unity, resulting in even simpler formulae [32], called the “shortest path” (sp) solution: 




E 

E 


k 

i—ks + l 
few 

i—k-\-l 




(24) 


These equations can be used to quickly estimate gain and loss rates of interest. For example, the gain rate from 
presites under strong selection is approximately 2 ifiks) ~ +1)). Although the exact value depends on 

the binding specificity and chemical potential, one can see that it is about Nsu for the parameter range of interest. 
Similarly, one can see that the rate of loss from strong sites is about 2n | A^s| u when there is strong negative selection. 


RESULTS 

Single TF binding site gain and loss rates under mutation-selection-drift are typically slow 


We first studied the evolutionary rates for a single TF binding site at an isolated DNA sequence of the same length 
under mutation, genetic drift, and directional selection for high gene expression level (i.e., tighter binding). As detailed 
in the Models & Methods section, we combined a thermodynamically motivated fitness landscape with the mismatch 
approximation, and assumed that the mutation rate is low enough for the fixed state population approximation to be 
valid. Under these assumptions, we could calculate the inverse of the average TFBS gain and loss times as a function 
of the starting genotype, using either an exact method or Wright-Fisher simulations. We considered point mutations 
alone, or point mutations combined with short indel mutations, in order to understand under which conditions the 
rates of gaining and losing binding sites can reach or exceed the rates 2 — 3 orders of magnitude greater than point 
mutation rate, and thus to become comparable to rates observed in comparative genomic studies. 

Fig.[§^ shows the dependence of the TFBS gain rate on the selection strength (with respect to genetic drift). Ns. 
For parameters typical of eukaryotic binding sites (length n = 7 bp, specificity e = 2 A:bT), the TFBS gain rates 
are extremely slow (practically no evolution) when there is negligible selection pressure (Ns ^ 0), indicating the 
importance of selection for TFBS emergence. Indeed, the effective selection needs to be very strong, e.g., Ns > 100, 
for TFBS evolution to exceed the per-nucleotide mutation rate by orders of magnitude and become comparable to 
speciation rates. 

Even if strong selection were present, the gain rate depends crucially on the initial genotype. While gain rates from 
presites, i.e., genotypes one mutation away from the threshold for strong binding, are roughly Nsu for the strong Ns 
regime (as estimated by Berg et al. |32]1. they decrease dramatically if more mutational steps are needed to evolve 
a functionally strong binding site. This is illustrated in the inset to Fig [2]4, showing an exponential-like decay in 
the gain rates as a function of the number of mismatches, even for a TFBS of a modest length of 7 bp. As argued 


in the Models & Methods section (see Eq (20)), we confirmed that the threshold for the strong Ns regime scales as 
nlog(2)/2 and not as log(n) which is the case for simple fitness landscapes [82] . 

The availability of a realistic fraction of indel mutations (here, 9 = 0.15) can speed up evolution when starting from 
distant genotypes (cf. solid and dashed red line in Eig[^). This is because indels connect the genotype space such 
that paths from many to few mismatches are possible within a single mutational step. Nevertheless, the improvement 
due to indel mutations does not alleviate the need for very strong selective pressure and the proximity of the initial 
to strongly-binding sequence, in order to evolve a functional site. 

Biophysical parameters—the binding site length n, the chemical potential and the specificity e—influence the 
shape of the fitness landscape and thus the TFBS gain rates. This is especially evident when we consider de novo 
evolution starting from random sequence. As shown in Figs |^, C, increases in specificity or length cause a sharp 
drop in the gain rates from initial sequences in the most redundancy rich class, which can be only partially mitigated 
by the availability of indel mutations. This especially suggests that adaptation of TFBS from random sequences for 
TF with very large binding lengths and very strong specificities is unlikely with point and indel mutations which can 
constrain the evolution of TF lengths and TF specificity, which is consistent with Berg et al. (2004) j^’s earlier 
numeric observation. Importantly, the binding specificity and length show an inverse relation with the logarithm of 
the gain rates. This is due to the fact that a decrease in specificity allows more genotypes to generate appreciable 
binding and therefore fitness (see Fig. which partially compensates the increase in mutational entropy at larger 
binding site lengths. Variation of the chemical potential fj, corresponding to an order-of-magnitude change in the free 
TF concentration does not qualitatively affect the results. 
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FIG. 2: Single TF binding site gain rates at an isolated DNA region. A) The dependence of the gain rate, l/{t)s^k 
shown in units of point mutation rate, from sequences in different initial mismatch classes k (blue: k = 2, red: k = 5), as a 
function of selection strength. Results with point mutations only {9 — 0) are shown by dashed line; with admixture of indel 
mutations {9 — 0.15) by a solid line. For strong selection. Ns ^ nlog(2)/2, the rates scale with Ns, which is captured well by 
the “shortest path” approximation (black dashed lines in the main figure) of Eq ( |24[ |. The biophysical parameters are: site length 
n = 7 bp; binding specificity e = 2 ksT; chemical potential /r = 4 ksT. Points correspond to Wright-Fisher simulations with 
Nu = 0.01 where error bars cover ±2 SEM (standard error of mean). Inset shows the behavior of the gain rates as a function 
of the initial mismatch class k for Ns = 0 and Ns = 100. B, C) Gain rates from redundancy rich classes (k ~ ZnjA:, typical of 
evolution from random “virgin” sequence) under strong selection, without (B) and with (C) indel mutations supplementing the 
point mutations. Red crosshairs denote the cases depicted in panel A. Gontour lines show constant gain rates in units of Nsu 
as a function of biophysical parameters n and e. Wiggles in the contour lines are not a numerical artefact but a consequence 
of discrete mismatch classes. 


Typically slow TFBS evolution is a consequence of the sigmoidal shape of the thermodynamically motivated fitness 
landscape, where adaptive evolution in the redundant but weakly binding classes W must proceed very slowly due 
to the absence of a selection gradient. To illustrate this point, we generated alternative fitness landscapes that agree 
exactly with the thermodynamically motivated one from the fittest class to the threshold class for strong binding, 
ks, but after that decay as power laws, TTpi, with a tunable exponent (see SI text). As seen in Fig|^ this exponent 
is a major determinant of the gain rates, suggesting that a biophysically realistic fitness landscape is crucial for the 
quantitative understanding of TFBS evolution. 

To check that the assumption of the fixed state population is valid at Nu = 0.01, the value used here that is 
also relevant for multicellular eukaryotes [S^, we performed Wright-Fisher simulations as described in the Models & 
Methods section. Fig[^ shows excellent agreement between the analytical results and the simulation. We further 
increased the mutation rate to Nu = 0.1, a regime more relevant for prokaryotes where polymorphisms in the 
population are no longer negligible, to find that the analytical fixed state assumption systematically overestimates 
the gain rates, as shown in Fig In the presence of polymorphism, therefore, evolution at best proceeds as quickly 
as in monomorphic populations, and generally proceeds slower, so that our results provide a theoretical bound on 
the speed of adaptive evolution under directional selection. This is expected since the effects of clonal interference 
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kick in after a certain Nu^ where two different beneficial mutants start competing with each other, and eventually 
decrease the fixation probability in comparison to one beneficial mutant sweeping to fixation as in the monomorphic 
population case. 

To check that the mismatch assumption does not strongly affect the reported results, we analyzed evolutionary 
dynamics with more realistic models of TF-DNA interaction. Different positions within the binding site can have 
different specificities, and one could suspect that this can significantly lower the evolutionary times. First, some 
positions within the TFBS may show almost no specificity for any nucleotide, most likely due to the geometry of 
TF-DNA interactions (e.g, when the TF can contact the nucleic acid residues only in the major groove); we have 
not simulated such cases explicitly, but simply take the binding site length n to be the effective sequence length 
where TF does make specific contacts with the DNA. Second, the positions that do exhibit specificity might do so 
in a manner that is more inhomogeneous than our mismatch assumption, which assigns zero energy to the consensus 
and a constant e to any possible mismatch. We thus generated energy matrices where e was drawn from a Gaussian 
distribution with the same mean (e) = 2 ksT as in our baseline case of Fig[2]4, but with a standard deviation 0.5 ksT. 
Figj^shows that both equal and unequal energy contributions produce statistically similar behaviors, indicating that 
inhomogeneous binding interactions cannot substantially enhance the evolutionary rates. 

We further investigated the rate of TFBS loss (Fig[^. Here too strong (negative) selection is needed to lose a 
site on reasonable timescales, and it is highly unlikely that a site would be lost in the presence of positive selection. 
In contrast to the TFBS gain case, however, negative selection and mutational entropy act in the same direction for 
TFBS loss, reducing the importance of the initial genotype and making selection more effective at larger n and e. 

Taken together, these results suggest that the emergence of an isolated TFBS under weak or no selection is typically 
slow relative to the species’ divergence times, and gets rapidly slower for sites that are either longer or whose TFs 
are more specific than the baseline case considered here. This suggests that biophysical parameters themselves may 
be under evolutionary constraints; in particular, if point mutations and indels were the only mutational mechanisms, 
the evolution of long sites, e.g. n ^ 10 — 12, would seem extremely unlikely, as has been pointed out previously [32]. 
Absent any mechanisms that could lead to faster evolution and which we consider below, isolated TFBS are generally 
only likely to emerge in the presence of strong directional selection and a favorable distribution of initial sequences 
that is enriched in presites. 


Convergence to the stationary distribution is slow and depends strongly on initial conditions 


A number of previous studies (e.g., [BH [TR] [73]) assumed that a stationary distribution of mismatch classes is 
reached in the evolution of isolated TFBS and thus an equilibrium solution, Eq (19), is informative for binding 
sequence distributions. In contrast, our results for average gain and loss times suggest that the evolution of an 
isolated TFBS is typically slow. To analyze this problem in a way that does not depend on arbitrary thresholds 
defining “strong” and “weak” binding classes S and W, we first examined the evolution of the distribution ip{k) over 
the mismatch classes as a function of time in Fig[^. For typical parameter values it takes on the order of the inverse 
point mutation rate to reach the stationary distribution for populations that start off far away from it, even with 
strong selection. 

A systematic study of the convergence rates can be performed by computing the (absolute value of the) second 
eigenvalue, IA 2 I, of the transition rate matrix R from Eq (18), and exploring how this depends on the biophysical 
parameters n and e. Consistent with previous results, we observe large increases in convergence times as n and e 
increase. Eor example, an increase in the binding site length from n = 7 to n = 11 at baseline specificity of e = 2 ksT 
would result in a ten-fold increase in the convergence time. 

The intuitive reason behind the slow convergence rates is in the bimodal nature of the distribution '0(fc) on the 
thermodynamically motivated fitness landscape, similar to that reported by Lynch & Hagner |3]. One “attractor” is 
located around the fittest class {k ^ 0, due to directional selection), while the other is located around the redundancy- 
rich mismatch classes {k ^ 3/4n). These two attractors are separated by a typically sharp fitness landscape, and the 
redundancy-rich attractor lacks selection gradients needed to support fast adaptation. The temporal evolution of the 
distribution xp{k) from, e.g., a maladapted state, can thus be best understood as the probability weight “switching” 
from resting approximately within one attractor to the other one, while maintaining the bimodal shape throughout, 
rather than a gradual shift of a unimodal distribution from a maladapted initial value of k to the value favored by 
selection. This is especially true when n gets larger: although adaptation within the functional sites can still happen, 
adaptation from the most random mismatch classes becomes extremely slow, even under strong selection (see Eig[^. 

These results suggest that stationary distributions of isolated TFBS sequences may not be realizable on the 
timescales of speciation, which should be a cause of concern when stationarity is assumed without prior critical 
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FIG. 3: Convergence to the stationary distribution of TFBS sequences. A) Evolutionary dynamics of the mismatch 
classes distribution ^f>{k) for an isolated TFBS under point and indel mutations {9 — 0.15), directional selection for stronger 
binding, and genetic drift is shown for initially well {k = 0, blue) and badly (fc = 5, red) adapted populations. At left, no 
selection {Ns = 0); at right, strong selection {Ns = 100). Different curves show the distribution of genotype classes at different 
time points {t = 0u~^, 0.05u“^, 0.1u“^ as decreasing opacity); stationary distribution is shown in green. Insets show the 
time evolution to convergence for initially well {k = 0, blue) and badly {k = 5, red) adapted populations, measured by the 
Kullback-Leibler divergence DKL['4’{t) ll'*/’(f = °o)]. The biophysical parameters are: n = 7 bp, e = 2 ksT, fj, = 4 ksT. B) 
Rate of convergence to the stationary distribution for different e and n values under strong selection {Ns S> nlog(2)/2; here 
specihcally Ns = 100) and for 9 = 0.15. Crosshairs represent the parameters used in a). 


assessment. For example, applications assuming the stationary distribution might wrongly infer selection on regula¬ 
tory DNA. 


Evolution of TF binding sites in longer sequences 


So far we have shown that the evolution of isolated TFBS is typically slow. How do the results change if we consider 
TFBS evolution in a stretch of sequence L bp in length, where L ^ n, e.g., within a promoter or enhancer? Here we 
focus on de novo evolution under strong directional selection for high gene expression, by simulating the process in 
the fixed state population framework. Compared to the isolated TFBS case, we need to make one further assumption: 
that the expression level of the selected gene is proportional to the summed TF occupancy on all sites within the 
regulatory region of length L (see Models & Methods for details). While this is the simplest choice, it is neither unique 
nor perhaps the most biologically plausible one, although limited experimental support exists for such additivity [M] ; 
it does, however, represent a tractable starting point when the interactions between individual TF binding sites are 
not strong and the contribution of each site is equal and of the same sign. To address the interactions, we look at 
the cooperative binding case in the following section. In Supporting Information, we also discuss the competition of 
TFBSs for the strongest binding, and the “nonphysical” synergetic interaction by two strongest TFBSs. 

We propose a simple analytical model for the time evolution of the number of strongly binding sites, z{t), in the 
promoter, derived from isolated TFBS gain and loss rates, Again and Aioss- Assuming constant rates, one can write 


_d 

dt 


z{t) — Again ( '^max 



'^loss'2^(0 


(25) 


where ^max 
can overlap, 


is the maximum number of TFBS that can fit into the regulatory sequence of length L bp. If the sites 
-Zmax = L — n + I, otherwise Zmax ~ L/n. The solution for Eq (25) is 


z 




B 

A 


(26) 


where A = (Again + Aioss), S = ZmaxAgain and Zo = z{t = 0). Under strong positive selection, i.e. Ns » nlog(2)/2, 
the loss rate Aioss can be ignored. If the distribution of the initial mismatch classes in the promoter is ^jJk, one can 
approximate Zmax - -Zo = ^max Ylk=ks+i obtain: 


n 

o ~ (l C ) Zniax ^ 

k=ks + l 


z{t) - Z, 


(27) 
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There are two limiting regimes in which we can examine the behavior of Eq (271. Over a short timescale, evolutionary 
dynamics will search over all possible positions, Zmax = T — n + 1, to pull out the presites, since they are fastest to 
evolve into the strong binding class S, i.e.: 




H ^ks + l/{t)s^ks + l 


(28) 




As the process unfolds and new sites are established, new TFBS will only be able to emerge at a smaller set of 
positions due to possible overlaps, so that Zmax ~ Lju. On the other hand, evolution from higher mismatch classes 
will also start to contribute towards new sites: 

Again « A^L = ( E E (29) 

k^S k^S 

Fig|4] shows how new TFBSs with length n = 7 bp emerge over time in a promoter of L = 30 bp in length. 
Consistent with the predictions of our simplified model, we can distinguish the early, intermediate, and late epochs. 
In the early epoch, t < 0.01m“^, presites are localized among all possible locations and are established as binding 
sites. During this period, the growth in the expected number of new TFBSs is linear with time. The importance and 
predictive power of presites at early epoch remain even under different models of gene expression, including interaction 


between TFBSs (see Fig 14). In the intermediate epoch, new binding sites accumulate at the rate that is slightly 


above that expected by establishment from presites alone, as the mutational neighborhood is explored further. In the 
late epoch, t > 0.1u“^, initial sites in the immediate mutational vicinity have been exhausted, and established sites 
have constrained the number of positions where new sites can evolve from more distant initial sequences, leading to 
the saturation in the number of evolved TFBS. 

Using the simple analytical model, we explored in Fig|^,C how the binding length n and specificity e affect the 
number of newly evolved TFBS. Increasing n leads to a steep decrease in the number of expected sites, with a somewhat 
weaker dependence on e, especially at early times. Simulations at other values of biophysical and evolutionary 
parameters confirm the qualitative agreement between the analytical model and the simulation (Fig |12[ ); given that 
the model is a simple heuristic, it cannot be expected to match the simulations in detail, yet it nevertheless seems to 
capture the gross features of evolutionary dynamics. Together, these results show that at early times under strong 
selection, the number of newly evolved sites will grow linearly with time and proportional to L, before evolution from 
higher mismatch classes can contribute and ultimately before the sites start interacting, with a consequent slowdown 
in their evolution. Thus, evolution in longer regulatory regions {L = 10^ — 10^ bp) could feasibly give rise to tens of 
binding sites at Ns = 10^ —10^ within a realistic time frame t ^ O.OOlri”^, if the sites are sufficiently short (n ~ 7 bp). 
Explaining the evolution of longer sites, e.g., n > 10 — 12 bp, especially within short promoters found in prokaryotes, 
would likely necessitate invoking new mechanisms. 


Ancient sites and cooperativity between TFs can accelerate binding site emergence 


Finally, we briefly examine two mechanisms that can further speed up the evolution of TF binding sites in longer 
sequences. 

The first possibility is that the sequence from which new TFBS evolve is not truly random; as discussed previously, 
presites have a strong influence on the early accumulation of new binding sites. There are a number of mechanisms that 
could bias the initial sequence distribution towards presites: examples include transposable elements, DNA repeats, 
or CG content bias. Here we consider an alternative mechanism that we refer to as the “ancient TFBS scenario,” in 
which a strong TFBS existed in the sequence in the ancient past, after which it decayed into a weak binding site, 
possibly due to the relaxation of selection (i.e.. Ns ^ 0). 

As we demonstrated in the context of isolated sites, TFBS loss rates are slow and the remains of the binding site 
will linger in the sequence for a long time before decaying into the most redundancy rich mismatch classes. This 
biased initial distribution of mismatches in a sequence of length L with a single ancient site can be captured by 
writing: 


^ = 


L — n + 


Y i’it') 


L — n 
L — n + 1 


<t> 


(30) 


where <p is the binomial distribution, Eq (j^, characteristic of the random background, and '0(t^) is the distribution of 
mismatches due to the presence of the ancient site. Time t' refers to the interval in which the isolated ancient TFBS 
has been decaying under relaxed selection, and the corresponding can be solved for using Eq 0- 
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FIG. 4: TF binding site evolution in a longer sequence of L = 30 base pairs. The expected number of newly evolved 
TF binding sites with length n — 7 bp, under strong directional selection {Ns = 100) and both point and indel mutations 
{0 = 0.15). Time is measured in inverse mutation rates; the number of newly evolved sites is scaled to the selection strength 
and the sequence length. 1000 replicate simulations were performed with different initial sequences. Average number of sites 
shown by a solid black line; the gray band shows ±2 SEM (standard error of the mean) envelope. Dashed curves are analytical 
predictions based on single TFBS gain rates at an isolated DNA region, given by Eqs ( 27|28|29 1. Biophysical parameters used: 
e = 2 fcflT, /i = 4 fcsT. Insets: Expected number of newly evolved sites from a random sequence of length L a.t t = 0.001u“^ 
(left) and t = O.lrt”^ (right) for different binding length and specificity values, computed using the analytical predictions. 
Crosshairs denote the values used in the main panel. 


Fig [5]A shows that the ancient site scenario can enhance the number of newly evolved sites by resurrecting the 
ancient site, even after it has decayed for t' = 0.1u“^. Simulation results agree well with the simple analytical model 
using the biased initial sequence distribution of Eq (30). Importantly, such a mechanism is particularly effective for 
longer binding sites of high specificity, indicating that regulatory sequence reuse could be evolutionarily beneficial in 
this biophysical regime (see Fig[I^. 

Fig[^ and Fig also show the emergence of new sites when the ancient site was not a full consensus (preferred) 
sequence but differed from it by a certain number of mismatches. The results qualitatively agree with the case of 
perfect consensus. Importantly, this shows that the applicability of the ancient site scenario extends to cases where 
the ancient site belonged to a different TF (albeit with a preferred sequence similar to the studied TF), which has 
recently been reported to be a frequent phenomenon by Payne & Wagner (2014) [JT], possibly due to evolution of 
TFs by duplication and divergence [SS]. 

The second mechanism that we consider is the physical cooperativity between TFs: when one site is occupied, 
it is favorable for the nearby site to be occupied as well. We extended the thermodynamic model to incorporate 




















15 




FIG. 5: Ancient sites and cooperativity can accelerate the emergence of TF binding sites in longer regulatory 
sequences. A) The expected number of newly evolved TFBS in the presence (red and brown) or absence (black) of an ancient 
site, for binding site length n = 10 bp, and specificity, e = 3 fcsT. In this example, the ancient site was a consensus site {k = 0) 
or two mismatches away from it {k = 2) that evolved under neutrality for t' = 0.1/u prior to starting this simulation. Dashed 
lines show the predictions of a simple analytical model, Eq (30l. The inset shows how the number of newly evolved TFBS at 
t = 0.001/u scales with the mismatch of the ancient site k (plot markers: simulation means; error bars: two standard errors of 
the mean; dashed curve: prediction). B) The expected number of newly evolved TFBS without (black) and with cooperative 
interactions (for different cooperativity strengths, magenta: = —2fcsT, yellow: = —SksT, cyan: E^ = —AksT, see 

Eq (111 in the Models & Methods and text) for binding site length n = 7 bp, and specificity, e = 2 ksT. Both panels use /r = 4 
ksT, strong selection {Ns = 100) and a combination of point and indel mutations {6 = 0.15), acting on a regulatory sequence 
of length L — 30 bp. Thick solid lines show an average over 1000 simulation replicates, shading denotes ±2 SEM. 


cooperativity (see the Models & Methods, Eq (11) and Fig[^). The genotype of a nearby site will then influence 
whether a given site acts as a strongly or weakly binding site. The presence of a cooperative site acts as a local shift 
in the chemical potential, which changes the weak/strong threshold, so that an individually weak site can become 
a strongly binding site. Simulations using cooperative binding presented in Fig[^ illustrate how cooperativity can 
increase the speed of evolution. This is specifically effective for short binding sites of intermediate or low specificity, 
where a cooperative energy contribution can strongly influence the number of sites in the strong binding class (see 
Fig[^. 


DISCUSSION 

In this study, we aimed at a better theoretical understanding of which biophysical and population genetic factors 
influence the fast evolution of TFBSs in gene regulatory DNA, making sequence specific TF binding a plausible mech¬ 
anism for the evolution of gene regulation and for generating phenotypic diversity. Following Berg et al. (2004) [32] . 
we combined a biophysical model for TF binding with a simple population genetic model for the rate of sequence 
evolution. The key assumptions are that binding probability is determined by a thermodynamic equilibrium; that 
fitness depends linearly on binding probability; and that populations are typically homogeneous in genotype, and so 
evolve by substitution of single point and short insertion/deletion (indel) mutations. Remarkably, the biophysical and 
the evolutionary models take the same mathematical form: in the biophysical model, binding probability depends on 
the binding energy, relative to thermal fluctuations, /3F, whilst in the evolutionary model, the chance that a mutation 
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fixes depends on its selective advantage, relative to random sampling drift, Ns. 

For single TFBS evolution, we calculated the average transition time between genotypes, the inverse being a 
measure for the speed of the evolution. Our results indicate that TFBS evolution is typically slow unless selection is 
very strong. It is important to emphasize that gaining a TFBS by point mutations under neutral evolution is very 
unlikely, contrasting with the belief in the current literature (e.g., [S1II31). This is mainly due to Stone & Wray’s 
argument that functional sites could readily be found by a random walk EH; however, their argument assumed that 
individuals follow independent random walks, which grossly overestimates the rate of evolution (see MacArthur & 
Brookfield [29] )• Indeed, fast rates of gaining a single TFBS require not only strong selection but also initial sequences 
in the mutational neighborhood of the functional sites. Especially, “presites,” i.e. sequences I bp away from threshold 
sequences, can be crucial since they can evolve to functional sites by single mutations. Indel mutations can increase 
the rate of gaining a single TFBS from distant sequences, since they connect the genotype space extensively, but their 
effect is limited under realistic indel mutation rates (TiiiTg. Future studies should consider the updates in estimates 
of indel mutation rates, since they are currently not as precise as point mutation rates, although we do not expect 
big qualitative departures from our results. 

Considering the evolution of a single TFBS from random sequence, we showed that biophysical parameters, binding 
length and specificity, are constrained for realistic evolutionary gain rates from the most redundant mismatch classes. 
The rates drop exponentially with binding length, making TF whose binding length exceeds 10 — 12 bp difficult to 
evolve from random sites, at least under the point and indel mutation mechanisms considered here. As a consequence 
of the biophysical fitness landscape, binding specificity and length show an inverse relation for the same magnitude 
of the gain rate from the most redundant mismatch class. Such an inverse relation is observed in position weight 
matrices of TFs collected from different databases for both eukaryotic and prokaryotic organisms, by Stewart & 
Plotkin (2012) jSj. In the same study, they reproduce this observation using a simple model which assumes that 
a trade-off between the selective advantage of binding to target sites, versus the selective disadvantage of binding 
to non-target sequence. Their model assumes a stationary distribution, and that sites are functional if they are 
mismatched at no more than one base. It would be interesting to explore a broader range of models that account 
for the dynamical coevolution between transcription factor binding specificity, its length, and its binding sites [3]. 
One idea can be to combine the evolutionary dynamical constraints (against large binding length and high specificity, 
which we show here) with simple physical constraints of TF dilution in non-target DNA (against short binding length 
and low specificity, again in an inverse relation |44j 1. 

For a single TF binding site, the stationary distribution for the mismatch with the consensus binding sequence 
depends on the binding energy, but also on the sequence entropy - that is, the number of sequences at different 
distances from the consensus. Typically, the distribution is bimodal: either the site is functional, and is maintained 
by selection, or it is non-functional, and evolves almost neutrally. We show that it may take an extremely long 
time for the stationary distribution to be reached. Functional sites are unlikely to be lost if selection is strong (i.e.. 
Ns ^ 1), whilst function is unlikely to evolve from a random sequence by neutral evolution, even if predicted under 
stationarity assumption. Therefore, typical rapid convergence to stationary distribution should be considered with 
caution in theoretical studies. 

We showed that the dynamics of TFBS evolution in longer DNA sequences can be understood from the dynam¬ 
ics of single TFBS. The rate of evolution of new binding sites will be accelerated in proportion to the length of 
the promoter/enhancer sequence in which that can be functional; however, because this increase is linear in pro¬ 
moter/enhancer length, it will have a weaker influence than the exponential effect changes in specificity or length of 
binding site. Especially the earlier dynamics (relevant for speciation timescales) are determined by the availability 
of presite biased sequences. Any process that allowed selection to pick up more distant sequences or that increased 
presite ratio among non-functional sites would accelerate adaptation from “virgin” sequences. 

A key factor for an enrichment in presite ratio may arise through variation in GC content or through simple sequence 
repeats (especially if the preferred sequence has some repetitive or palindromic structure). In this study, we showed 
that it may also arise from ancient sites, i.e. sites that were functional in earlier evolutionary history and decayed 
into nonfunctional classes in evolution. Since loss of function is slow (comparable to the neutral mutation rate once 
selection becomes ineffective), this is plausible for sites that are under intermittent selection, or where there is a shift 
to binding by a new TE with similar preferred sequence gTlIHS]. This effect of the earlier evolution can be especially 
important for long binding TFs as convergence to a truly randomized sequence distribution requires much longer 
times. MacArthur and Brookfield [29j showed that real promoter sequences may acquire functional sites more quickly 
than random sequence, but it is not clear whether that is due to a different general composition, or to the ghosts of 
previous selection. New studies are required to test our enriched presite-biased sequence hypothesis, especially for 
orthologous regions where functional TFBS is observed in sister populations or species. In a recent study, Villar et 
al. (2015) [54] provide evidence that enhancer DNA sequence structure is older than other DNA portions, suggesting 
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the reuse of such regions in evolution, plausibly by gaining and losing TFBSs in repetitive manner. Nourmohammad 
& Lassig (2011) [30] showed evidence suggesting that local duplication of sequences followed by point mutations 
played important role in binding site evolution in Drosophila species (but surprisingly, not in yeast species). Another 
interesting option would be the existence of “mobile” presites or their fragments, e.g., as sequences embedded into 
transposable elements that could be inserted before the gene under selection for high expression |5S|. Presites can 
be considered as concrete examples of cryptic sequences [55], potential source of future diversity and evolvability. 
We believe that understanding the effects of presites would contribute to the predictability of genetic adaptations 
regarding gene regulation, especially in important medical applications such as antibiotic resistance or virus evolution. 

We also showed that the evolution of a functional binding site in longer DNA can be accelerated by cooperativity 
between adjacent transcription factors. When a TF occupies a co-binding site, sufficient transcriptional activity can 
be achieved from sequences of larger mismatch classes, an effect similar to a local increase in TF concentration. 
This mechanism permits faster evolution towards strongly binding sequences, and seems most effective for short 
TFBS where it creates a selection gradient already in the redundancy rich mismatch classes. Cooperative physical 
interactions might allow the evolution of binding occupancy and thus expression without large underlying sequence 
changes, which might be a reason for the observed weak correlation between sequence and binding evolution at certain 
regulatory regions. Importantly, TFBS clustering in eukaryotic enhancers can be a consequence of the fast evolution 
with cooperativity, as also supported by a recent empirical study m- 

Our theoretical framework is relevant more broadly for understanding the evolution of gene regulatory architecture. 
Since the speed of TFBS evolution from random sequences is proportional to NsL, our results suggest that population 
size N and the length of regulatory sequences L can compensate for each other in terms of the rate of adaptation. This 
is exactly what is observed: eukaryotes typically have longer regulatory DNA regions but small population sizes, while 
prokaryotes evolve TFBS within shorter regulatory sequence fragments but have large population sizes. Similarly, 
prokaryotes might have achieved longer TF binding lengths n, as large population size allowed them to overcome 
the exponential decrease in the gain rates with increasing n. If relevant, these observations would suggest that an 
important innovation in eukaryotic gene regulation must have been the ability of the transcriptional machinery to 
integrate the simultaneous occupancy of many low-specificity transcription factors bound over hundreds of basepairs 
of regulatory sequence, a process for which we currently have no good biophysical model. 
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SUPPORTING INFORMATION 


Other fitness models for comparison &: for interacting TFBSs 


Power-law decaying fitness models for comparison: 


In order to understand the importance of the thermodynamically-motivated sigmoid shape for the binding prob¬ 
ability, we compare our results to those obtained with power-law functions that decay with exponent 7 (note that 
7 = 00 corresponds to a step-like fitness landscape), formally defined as 

T^Tnik) k<ks , , 

{ks/ky TTTuiks) k> ks 

Fig shows that the power-law exponent is a major determinant of the gain rates, suggesting that a biophysically 
realistic fitness landscape is crucial for the quantitative understanding of TFBS evolution. 



Fitness models of interacting TFBSs in larger regulatory sequence: 

In addition to physical cooperativity between nearby TFs on promoter/enhancers (see the Models & Methods, Fig[^ 
and Fig[T§, here we also consider two other models. The first additional model assumes that the binding occupancy 
of the strongest binding site in the regulatory sequence is the proxy for the gene expression level and the fitness, i.e. 

/(cr) = sMAX{ 7 r«(cr)}. (32) 

Note that different TFBSs interact with each other to compete for the strongest binding within a promoter or an 
enhancer. 

The second additional model addresses synergistic interaction between the two strongest-binding TFBS, located 
anywhere in the regulatory sequence. This example is a simplified version of a biophysical model where TFs, binding 
anywhere in a regulatory region, compete for the occupancy of that region with a nucleosome (for a more elaborative 
modeling framework, see Mirny (2010) [SS]). We call this type of interaction between two TFs “non-physical” because 
TFs don’t interact directly; their interaction is effectively mediated by some other biophysical process. The probability 
of the joint occupancy of the two TFs at promoter or enhancer can be used as the proxy for gene expression level and 
the fitness, i.e. 


f{(^) = 


3-/3(e(fei+fc2)-2/i) 


1 -(- e-P{cki-n) -(- g-P{i:k2-n) g,-P(c{ki+k2)-2n)' 


(33) 


where ki and k 2 correspond to the genotypes of two TFBSs with the smallest mismatches in the regulatory sequence. 

Do these models yield different result for the emergence of strong binding sites from random sequences at early 
evolutionary times (~ speciation time scales), in comparison to our main model, where the sum of binding occupancies 
is used as a proxy for gene expression level [EqQ in the main text]? For typical biophysical parameters (binding 
lenght: n = 7 bp, binding specificity: e = 2 and chemical potential: /i = 4 ksT), we show in Fig 14 that these 
modified models do not differ extensively from results of our main model. 
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FIG. 6: Indel mutations connect the mismatch genotype space differently from point mutations, a) Probability 
that a binding site with k mismatches mutates to k' mismatches, for a single binding site of length n = 7 bp, according to our 
indel mutation mode l in a fixed genomic window (see the Models & Methods section). Dashed curve = analytical prediction 
according to Eq (13 1 . Red points = mean ±1 std of 10® replicate realizations of the frequency distribution (for each replicate, 
1 consensus sequence is created and 10^ mutations are simulated for each k). b) The same analysis as in a), but allowing for 
a flexible genomic window for alignment after insertion mutations. We pick the minimal mismatch case to asses the quality of 
our approximation. As expected, this creates a bias towards smaller mismatch classes, but suggests that our approximation is 
still reasonable. 
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FIG. 7: Threshold value of Ns for bimodality (i.e., threshold between strong and weak selection regimes). 

The value of Ns at which 5% of the probability weight in the stationary distribution is in non-strong mismatch classes, i.e. 
k > ks. For selection stronger than this threshold, the stationary distribution is concentrated at low k (high fitness) classes 
and is practically unimodal. Different colors correspond to differen t bi ophysical parameters (see legend), analytical prediction 
nlog(2)/2 is in black (see the Models & Methods section and Eq (20l). Insets show examples of stationary distributions for 
different Ns values for short and long binding sites. 
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FIG. 8: Single TFBS gain rates in modified fitness landscapes with a power-law tail. The thermodynamic fitness 
landscape has been modified to have a power-law decaying tail of exponent 7 for fc > fcg, as in Eq ( |31[ ) in SI text. We tested 
7 = 1, 2 and 00 corresponding to smooth, intermediate and step-like decay. Plot conventions are the same as in Figl^. b) 
Isolated TFBS gain rate from the most redundant mismatch class for the thermodynamic model, replotted from Fig]^ for 
reference, c) Plots analogous to b) using modified fitness landscapes defined by the power-law exponent 7 . Gain rates are 
higher for small 7 = 1 and lower for the step landscape (7 = 00 ), relative to the reference. 
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FIG. 9: The effect of polymorphisms on the single TFBS gain rate at higher mntation rates. Wright-Fisher 
simulation results (point markers, error bars = 2 standard errors of the mean) at A:Nu = 0.1, in comparison to the fixed state 
model (continuous curves). Plot conventions are the same as in Fig Biophysical parameters used; n = 7, e = 2 ksT, 
/i = 4 ksT. Polymorphisms generally decrease TFBS gain rates. 



FIG. 10: Relaxing the mismatch assumption. Fig but using energy matrices whose nonzero entries are gaussian 
random variables £i, such that {si) = e = 2fesr and (Jg = O.dkBT; n = 7, /r = AksT. The analytical results under the equal 
mismatch assumption are shown in continuous lines. 
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FIG. 11: Single TF binding site loss rates at an isolated DNA region. The dependence of the loss rate, l/(t)w<-fe 
shown in units of point mutation rate, from sequences in different initial mismatch classes k (blue: k = 2, red: k — 0), as a 
function of negative selection strength. Results with point mutations only {9 = 0) are shown by dashed line; with admixture of 
indel mutations {6 = 0.15) by a solid line. For strong selection, \Ns\ ^ 1, the rates scale with 2\Ns\nu, which is captured well 
by the “shortest path” approximation (black dashed lines in the main figure) of Eq ( |24[ ). The biophysical parameters are: site 
length n = 7 bp; binding specificity e = 2 fcsT; chemical potential /i = 4 ksT. Left inset: As-scaling with positive selection. 
Right inset: gain rates as a function of the initial mismatch class k for different Ns. b, c) Loss rates from the consensus 
sequence {k — 0) under strong negative selection, without (b) and with (c) indel mutations supplementing point mutations. 
Red crosshairs denote the cases depicted in panel a). Contour lines show constant loss rates in units of Ns u as a function of 
biophysical parameters n and e. 
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FIG. 12: TFBS evolution in longer sequences. Example simulations (black solid line) and analytic predictions based on 
single TFBS gain/loss rates (black dashed line), for different binding length n and specificity e. Details are identical to Fig. 
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FIG. 13: The effect of ancient sites (a) and cooperativity (b) for different binding lengths and specificities. 

Simulations of TFBS evolution in longer sequences (colored lines) and analytic predictions based on single TFBS gain and loss 
rates (dashed black lines), analogous to Fig.[^ Different panels show different choices of TFBS binding length n and specificity e. 
Ancient sites specifically facilitate the emergence of longer sites of high specificity, whereas cooperativity specifically facilitates 
the emergence of shorter sites of intermediate or low specificity. 
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FIG. 14: Fitness models of interacting TFBSs. The expected number of newly evolved TFBS for binding site length 
n = 7 bp, specificity e = 2 ksT, and chemical potential ij, = A ksT are shown for different fitness models. The solid black curve 
is the non-interacting model nsed in the main text (dashed curve: theoretical prediction). The green cnrve stands for the model 
of Eq (32 I in SI text, where only the strongest binding site in the regulatory sequence determines gene expression. The purple 
curve stands for the model of Eq (331 in SI text, where two strongest TFBS synergistically determine the gene expression level. 
Shading denotes ±2 SEM. The simulations use regulatory sequences of length L = 30 bp (left) and L = 50 bp (right). 
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FIG. 15: Comparison rates of TFBS gain rates and sequence turnover rates within functional TFBSs Average 
first hitting times to particular mismatch kj state can be calculated with a minor modification to Eq ( |21[ ) by replacing S with 
kj. The figures compare the rates of evolution of TFBS within the functional sites (i.e. l/(t)fc=o<-fe=i and l/(t)fc=i<-fc=o). Plot 
conventions are the same as in Fig|^A. Biophysical parameters used: n = 7 bp (left), n = 10 bp (right) e = 2 ksT, ^ — A ksT. 
It shows that for weak selection, the rates to evolve from fc = 0 to fc = 1 can be relatively faster. Also, although adaptation 
from random sites slows down with increasing n, we see that the adaptation rate to evolve from fc = 1 to fc = 0 can stay high. 
























